NoSleep2024

Author

Rosemary Pang

Happy Halloween! This is the Halloween special for Text as Data course. In this project, I scrape the NoSleep subreddit and analyze horrow stories people shared.

library(rvest)
library(stringi)
library(stringr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(stm)
stm v1.3.7 successfully loaded. See ?stm for help. 
 Papers, resources, and other materials at structuraltopicmodel.com
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()         masks stats::filter()
✖ readr::guess_encoding() masks rvest::guess_encoding()
✖ dplyr::lag()            masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(quanteda)
Package version: 4.1.0
Unicode version: 14.0
ICU version: 71.1
Parallel computing: disabled
See https://quanteda.io for tutorials and examples.
library(quanteda.dictionaries)
library(quanteda.sentiment)

Attaching package: 'quanteda.sentiment'

The following object is masked from 'package:quanteda':

    data_dictionary_LSD2015
library(quanteda.textplots)
library(ggplot2)
library(wordcloud)
Loading required package: RColorBrewer
library(data.table)

Attaching package: 'data.table'

The following objects are masked from 'package:lubridate':

    hour, isoweek, mday, minute, month, quarter, second, wday, week,
    yday, year

The following object is masked from 'package:purrr':

    transpose

The following objects are masked from 'package:dplyr':

    between, first, last
library(text2vec)
library(cleanNLP)

Getting Data Ready

This section includes data cleaning and preprocessing. After reading in the data, I found some stories have more than one part, which is marked as [Part 1], [Part 2], etc. So I delete these before preprocessing.

NoSleep <- read_csv("NoSleep.csv") %>%
  select(-1)
New names:
Rows: 958 Columns: 4
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(3): ...1, title, text date (1): date_utc
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
head(NoSleep)
# A tibble: 6 × 3
  date_utc   title                                                         text 
  <date>     <chr>                                                         <chr>
1 2024-07-23 I thought my husband was the one licking my feet. I was wron… "My …
2 2024-07-23 The Road Trip Tapes- All roads lead to dead ends              "I d…
3 2024-07-23 Does the Mod Team know your Address as well?                  "Hey…
4 2024-07-23 Our Ocean                                                     "The…
5 2024-07-23 The light eyed ones                                           "*Be…
6 2024-07-23 I Boarded a Train to Nowhere                                  "I'v…
#Delete \n\n for paragraph changing
NoSleep$text_clean <- NoSleep$text %>%
  str_replace_all("\n\n", " ") %>%
  str_replace_all("\n", " ") %>%
  str_replace_all("\\*", "") %>%
  str_replace_all("\\[.*?\\]", "") #Delete [Part 1][Part 2]

#Double checking: if there's no space between a period and a word, we add a space
NoSleep$text_clean <- gsub("([[:punct:]])(?=[A-Z])", "\\1 ", NoSleep$text_clean, perl = TRUE)

NoSleep$text_clean[1]
[1] "My husband had a foot fetish. He even told me about it on our first date. And despite the fact that I was never too into feet myself, I appreciated that he was upfront about it, and opened myself up to it over time. Not on the giving side, of course. I've always just personally found feet disgusting... ... But on the receiving side. Which worked out perfectly, because my husband's particular foot fetish, was licking toes. Yup, really getting in there with his tongue and slobbering all over each of my toes. One by one. Now, it took me years of exploring my husband's kink, to build up enough tolerance not to burst out laughing from the tickling sensation within 30 seconds. And after four years of dating, and nearly ten years of marriage, I eventually became something of a pro at getting my feet licked and maintaining my composure. In fact, I got so good at it, that whenever my husband would lick my toes, I could completely tune it out, often multitasking things like filing my nails, reading a book, and streaming a show, as he licked away. Heck, sometimes I'd even fall asleep while he did it. Which is why, on that fateful night in 2019, when I was lying in bed in the dark, my feet dangling off the mattress, and I began to feel a tongue slither its way up and down each and every one of my toes, I could not have been less bothered. Truth is, I probably wouldn't have even noticed that something was off, were it not for the fact that it felt as if my husband was trying a new technique. Something that he had never implemented before in the fourteen years that we had been together... ... The bite. Yes, after feeling him suck each and every one of my toes, I then proceeded to feel more intense pressure, as I felt his teeth begin to clamp down on my feet. \" Feeling adventurous?\" I called out in the dark, before shaming him, as he sometimes asked me to do. \" You're a naughty, naughty boy and you're gonna pay for that.\" I reached over to my bedside table, opened its drawer, and removed a pair of handcuffs that he sometimes liked me to use on him after he was finished with my feet. \" Don't make me restrain you.\" I called out again. But rather than play along, and flirt back from the foot of the bed as he always did, he suddenly stopped biting my feet. \" Honey?\" I asked \" Why'd you stop?\" But he didn't reply. For a moment, there was only silence... ... Until suddenly, I heard my husband's cell phone ringtone go off in the bathroom. RING. RING. \" You gonna get that?\" I asked. But he continued to remain quiet, ignoring my question. \" Fine, I'll get it.\" I said, as I hopped up from the bed and tiptoed through the dark, over to the bathroom, its door having been closed since I retired to the bedroom earlier that night. As I approached it, I couldn't help but notice that light was emanating from underneath the crack between the rug and the door. \" Honey, you left the light on in the bathroom again.\" I called back to the bed, as I began to open the bathroom door. \" I'm gonna have to punish you even more-\" But before I could finish my sentence, my jaw dropped and I began to shake... ... As I saw my husband's lifeless body just lying there in the shower, its limbs mangled, its skin pale, as whatever blood remained in it poured down the drain. I simply froze, as my mind began to race. If that's my husband. And he is dead. Then who the fuck was just licking feet? I thought to turn around, and stare back into the dark bedroom. But I was too terrified to see who might be standing behind me. So I opted to instead walk over to my husband's body, and remove his cell phone from his pocket, which was still ringing, over and over and over again. RING. RING. I looked down at the phone, as I held it in my hand, and saw an incoming call from an \" Unknown Number.\" I answered it. \" Hello?\" For a moment, there was silence... ... Until a quiet, raspy voice simply whispered back over the line. \" I'm under the bed. Want to play some more?\" Suddenly, chills ran down my spine and my body went into fight or flight, as my heart began racing and I began to hyperventilate. Not knowing what else to do, I gathered myself and in a single motion, turned around, darted into the bedroom and ran for the hallway, as I tried to slam the door behind me. But just before I was able to close it, I saw what looked like the silhouette of a man crawl out and block it from shutting, as he writhed on the ground like an insect. And as I pushed and pushed, trying to keep whoever or whatever was in my bedroom from getting out, I saw his body wriggle a few inches more into the hallway and suddenly felt... ... His tongue lick a single one of my toes. \" Aaaaaaaargh!!\" I screamed at the top of my lungs, before stumbling back, turning around and running down the stairs, as I saw whoever or whatever it was chase after me on his hands and knees. SLAM. I stood there, outside my house, leaning against the door, as my body shook and tears ran down my eyes, before eventually willing myself to run to my car. As I backed out of my driveway and peeled off down the street, I looked back at the door, only to find that he never opened it, for some reason choosing not to chase after me. Thirty minutes later, I was at the police station, reporting what had happened. And an hour after that, the cops were combing my house. But the strange figure who had killed my husband and licked my feet... ... Was gone. Five years later, I still can't fall asleep at night unless I'm under the covers, for fear of feeling a tongue lick my toes. And while I haven't stopped mourning my husband, if I ever date again, I think it's fair to say I could never date someone with a foot fetish again. I've tried to move past what happened. Tried to move past the shocking events of that fateful night. But whenever I tell someone my story and they bring up the urban legend of \" The Licked Hand,\" I tell them, it's no legend at all... ... And it licks feet too."
NoSleep$text_clean[5]
[1] "Beware the light eyed ones!  That's what  my grandma had always said to my mom,aunts and uncles and eventually to her grand kids and great grandchildren.  My grandfather always thought it was strange as well buy he died years before I was born,in a car crash.  My mother had told me that her mom had proudly told her and her siblings that she made sure they all had brown eyes once they were born, and even did the same to myself, all of my cousins, my younger brother,and my 3 younger sencond cousins  Apparently my grandma practically had to fight with the doctors a few times while trying to get to a few of us to make sure our eyes were brown or hazel.  Every one of us do,in fact have shades of brown eyes. From very dark brown, to medium, to light brown, to amber, to hazel eyes with a little green flecks.  She also threatened to disown one of her sons when he dated a green eyed girl when he was a teenager. She demanded that they all only dated people with brown eyes. Hazel with green flecks was the only exception to her.  She also told her kids that any friends they brought over were to have brown eyes as well . She always made sure us grandkids did the same.  Anytime she came across someone with blue,green or gray eyes,she would get a terrified look on her face and get away from them as fast as possible  She would never even go out to family restaurants unless she knew the waiter or waitress had dark eyes    No one ever truly knew why and whenever she was asked,she would just say the sane phrase every single time.  Beware the light eyed ones   That is, until I went to visit her on her death bed a few days ago. Multiple family members had come and gone to say their good byes to her already.  She was 86 when she passed. Looking at her frail body and pale skin while she laid on that bed  brought tears to my eyes. Besides her weird obsession with people having brown eyes and avoiding those who didn't like they were a disease,she was a great woman  .  She always had fun stuff for us grandkids to do when we visited her and was otherwise delightful  She had not a single  racist or homophobic bone in her body. She just had something against light eyed people.  When I sat down beside her,she grabbed my hand as tight as she could and told me how much she loved me and that she had to tell me something that she couldn't even tell her husband,friends or even her own kids  I was surprised by this and happily agreed to listen to her get whatever she held back for so long off of her chest to me....but after listening to her story, I am terrified Here is what she said to me before dying 2 says later  As you know, I have never been fond of non-brown eyed people in my life. I know its been a mystery to so many of you for years now. And now that I am on my way out, I am finally able to be at peace knowing they can never get me.  When I was little ,almost everyone I knew had dark eyes in the town I lived in. There were a few that I knew who had light eyes but they were rare  One day on my way to school when I was in 3rd grade,a young man and  woman were just standing in my path,smiling down at me. They both looked to be around 25 to me   They both had very beautiful light eyes. The man had eyes as green as emeralds and the woman had eyes the color of a blue sky mixed with silvery gray. I smiled at them as I attempted to walk around them as I knew to not make small talk with strangers...yes we were taught that even back in those days But as I tried to walk around them, I felt hands grab me from behind as I was then thrown to the ground. Before I could even scream, I was put into a choke hold by the man as the woman stared down from above me...those smiles still plastered to their faces and their lifeless light eyes staring down at me as the world faded to black around me When I came to, I found myself tied to a tree and gagged, in front of the lake in my town, where I learned to swim. As if on que,the man and woman emerged from some other nearby trees and 4 more people followed behind them  They all stood in front of me with blank expressions on their faces and they didn't blink. I noticed something that they all had in coomon. They all had light eyes. The first man had green,the woman had grayish blue,another man had plain gray,another woman had baby blue,another man had greenish blue and the last woman had dark blue eyes  After a few moments the man who choked me out stepped foward and bend down so his eyes were in front of mine. He then pulled out a silver spoon from his pocket, and told me that I needed to get my new pretty eyes so they could have a child in the new world that he is starting  I still remember the fear I felt as he brought up the spoon inches from my left eye as the other 5 people simply smiled and watched. But just before the spoon touched my eye, I heard a gun shot. The man stopped,dropped the spoon and fell down to his side...dead  The other 5 people tried to run off but more gunshots rang out and  they too all dropped one by one. Several hunters were out that day hunting duck and  came across what those freaks were doing and knew they had to take action and began firing from a short distance  They ran out from their hiding spots and untied me and settled me down as much as my hysteric self could be. I was reunited with my parents and siblings later that day as we all cried together. My mom said she would never let me be again  One woman who was shot survived, and after being interrigated for hours,she admitted that they were a small group called the Light eyed ones. They wanted every person in the world to have non- brown eyes. And they would do anything to make that happen. They had stalked me for weeks and planned to kidnap me,take out my eyes and put in glass eyes with blue irisis to make me perfect  She was sent to a lonely house after she was deemed insane by doctors where I think she spent the rest of her days as far as I know  But not long after that happened as I was returning to normal life as best as I couuld, I found a note on my dresser one day that said the other light eyed ones are still around and would grow in numbers and would make me join them soon  I never told anyone and threw it away as I didn't want my parents to know and I was still just a little girl who didn't want them to freak out again. I was homeschooled after that day as I feared a light eyed one would appear in my school and kidnap me while the other kids watched in horror  That is why I've never trusted them....any of those damn light eyed ones. I have been on alert for many decades....years of nightmares them staring down at me with those cold light eyes....years of fear and paranoia. But at least it will be over soon  I was so shocked by her confession but I held back and so my hugged her as she began to sob after she finished her story.  As I left later that day, I came across a blue eyed nurse who smiled at me ...and I felt fear...and as I hurried past her I swear she mumbled under her breath  You will join the light eyed ones  one day"
#We first need to clean the text. For example, some authors use \n\n to separate paragraphs, which is not showing in the text. Causing "fetish.He" etc. So first I want to add a space in between.
#NoSleep$cleaned_text <- gsub("([[:punct:]])(?=[A-Z])", "\\1 ", NoSleep$text, perl = TRUE)
#double-check
#writeLines(NoSleep$cleaned_text[1])

We only look into the main content of the stories. Now for data pre-processing: change into corpus, tokenization, remove punctuation, numbers, symbols, url, stopwords, to lower case, and lemmatize.

story_corpus <- corpus(NoSleep$text_clean)
Warning: NA is replaced by empty string
# Removing apostrophes before tokenization
story_corpus <- gsub("'", "", story_corpus)

story_token <- tokens(story_corpus,
                   remove_punct = T,
                   remove_symbols = T,
                   remove_numbers = T,
                   remove_url = T,
                   split_hyphens = F,
                   include_docvars = T) %>%
  tokens_tolower()

# remove stopwords
story_token <- tokens_select(story_token,                 pattern=c(stopwords("en"),"t","s","re","im",                     "wasnt","didnt"),
                  selection="remove")

# lemmatization (I may not use this version)
story_token_lem <- tokens_replace(story_token,
                                  pattern=lexicon::hash_lemmas$token,
                                  replacement = lexicon:: hash_lemmas$lemma)

# create document feature matrix
storyDfm <- story_token_lem %>%
                   dfm()
storyDfm
Document-feature matrix of: 958 documents, 25,956 features (97.78% sparse) and 0 docvars.
       features
docs    husband foot fetish even tell first date despite fact never
  text1      11   14      3    4    3     1    4       1    3     4
  text2       0    2      0    7    5     1    0       1    0     3
  text3       0    0      0    1    0     1    0       0    1     0
  text4       0    4      0    9    4     7    0       2    0     7
  text5       1    0      0    6    8     1    2       0    1     6
  text6       0    0      0    7    0     2    0       1    0     0
[ reached max_ndoc ... 952 more documents, reached max_nfeat ... 25,946 more features ]
topfeatures(storyDfm,50)
     like        go       get      just       see      look      back       say 
    10628     10191      9986      9784      9456      8824      8787      8308 
     know       one      feel      make      time      come       eye something 
     7969      7555      7027      6603      6147      6039      6033      5977 
    think      take      door      have      good       try    around     leave 
     5815      5492      5342      5264      5097      4922      4613      4581 
       us      even     thing      hear       now      tell      hand      didn 
     4521      4507      4415      4295      4146      4092      4089      4071 
     find     still      room     start       can     night       way      turn 
     4067      4024      3982      3873      3849      3800      3793      3787 
     face      want      head       day      open     house      away       ask 
     3766      3712      3664      3477      3384      3306      3276      3261 
      man      long 
     3177      3169 

We see that the top features are not horrifying at all… So simply making a word cloud may not be very helpful.

Find the Monsters

What are those THINGS that scares people? We first develop a list of seed words, such as vampires, werewolves, ghosts, skeletons, etc, then use pre-trained GloVe to find words similar to the seed words.

GloVe <- fread("glove.6B.50d.txt", header = FALSE, sep = " ", quote = "")
GloVe <- as.data.frame(GloVe)
GloVe[1:6,1:6]
   V1       V2        V3       V4        V5      V6
1 the 0.418000  0.249680 -0.41242  0.121700 0.34527
2   , 0.013441  0.236820 -0.16899  0.409510 0.63812
3   . 0.151640  0.301770 -0.16763  0.176840 0.31719
4  of 0.708530  0.570880 -0.47160  0.180480 0.54449
5  to 0.680470 -0.039263  0.30186 -0.177920 0.42962
6 and 0.268180  0.143460 -0.27877  0.016257 0.11384
#change the first column as row names
rownames(GloVe) <- GloVe[[1]]
GloVe <- GloVe[, -1]
GloVe[1:6,1:6]
          V2        V3       V4        V5      V6        V7
the 0.418000  0.249680 -0.41242  0.121700 0.34527 -0.044457
,   0.013441  0.236820 -0.16899  0.409510 0.63812  0.477090
.   0.151640  0.301770 -0.16763  0.176840 0.31719  0.339730
of  0.708530  0.570880 -0.47160  0.180480 0.54449  0.726030
to  0.680470 -0.039263  0.30186 -0.177920 0.42962  0.032246
and 0.268180  0.143460 -0.27877  0.016257 0.11384  0.699230

Now we have calculated the GloVe for the tokens. Let’s move on to compare cosine similarity using the list of seed words.

# Change GloVe into a matrix as sim2() function expects x and y to be matrices
GloVe <- as.matrix(GloVe)

# List of seed words
seed_words <- c("vampire", "werewolf", "zombie", "ghost",  "witch", "goblin", "demon", "skeleton", "clown")

# Initialize an empty list to store the results
similar_words <- list()

# Loop through each seed word
for (seed in seed_words) {
  # Check if the seed word exists in word_vectors
  if (seed %in% rownames(GloVe)) {
    # Get the vector for the seed word
    word_vec <- GloVe[seed, , drop = FALSE]
    
    # Calculate cosine similarity
    sim <- sim2(x = GloVe, y = word_vec, method = "cosine", norm = "l2")
    
    # Sort the results and get the top 10 most similar words
    top_similar <- head(sort(sim[, 1], decreasing = TRUE), 20)
    
    # Store the results in the list
    similar_words[[seed]] <- top_similar
  } else {
    # If the word is not found, print a message
    print(paste(seed, "not found in GloVe."))
  }
}

# Output the results for each seed word
for (seed in seed_words) {
  if (!is.null(similar_words[[seed]])) {
    cat("\nTop words similar to:", seed, "\n")
    print(similar_words[[seed]])
  }
}

Top words similar to: vampire 
    vampire    werewolf       beast       witch    vampires      slayer 
  1.0000000   0.8408670   0.7993797   0.7866037   0.7734474   0.7612962 
    villain      zombie   superhero protagonist       ghost      spider 
  0.7413340   0.7328219   0.7222879   0.7191358   0.7184119   0.6992059 
      demon     monster      batman       buffy       fairy  spider-man 
  0.6965436   0.6942203   0.6932109   0.6898299   0.6892287   0.6888385 
    mystery     dracula 
  0.6837056   0.6836893 

Top words similar to: werewolf 
  werewolf    vampire   vampires     zombie werewolves     slayer      witch 
 1.0000000  0.8408670  0.7283907  0.7211610  0.7036923  0.7034839  0.6899053 
   warlock      beast  superhero     mutant     kitten    zombies    villain 
 0.6823281  0.6816036  0.6710823  0.6693557  0.6654401  0.6641041  0.6485555 
    undead     daleks   magician      bonzo    unicorn      demon 
 0.6466679  0.6433686  0.6424505  0.6407809  0.6395843  0.6379377 

Top words similar to: zombie 
    zombie    zombies      beast    monster     undead    vampire   werewolf 
 1.0000000  0.8129889  0.7843416  0.7740901  0.7588871  0.7328219  0.7211610 
  monsters     slayer terminator  superhero apocalypse      ghost    cartoon 
 0.7013510  0.6980167  0.6864493  0.6863230  0.6847633  0.6827515  0.6807979 
  punisher     killer     psycho   animated     flicks        bug 
 0.6753380  0.6723068  0.6717403  0.6676556  0.6634084  0.6628352 

Top words similar to: ghost 
     ghost      beast    monster   stranger   paradise   creature    haunted 
 1.0000000  0.8085173  0.7932728  0.7651840  0.7383574  0.7348251  0.7309647 
 fictional   monsters    vampire adventures      tales     lonely    mystery 
 0.7249350  0.7227428  0.7184119  0.7162226  0.7155273  0.7112124  0.7107794 
    ghosts       tale  adventure       hell mysterious        cat 
 0.7107621  0.7096840  0.7074680  0.7069266  0.7018905  0.6978224 

Top words similar to: witch 
    witch   vampire     beast    wicked  werewolf    monkey      evil    rabbit 
1.0000000 0.7866037 0.7306837 0.6955752 0.6899053 0.6857144 0.6844851 0.6772360 
    ghost  vampires       mad       dog   monster      tale       ape       cat 
0.6762089 0.6753860 0.6708003 0.6692684 0.6670857 0.6668315 0.6642747 0.6610318 
  witches     fairy    killer     snake 
0.6567832 0.6467018 0.6463133 0.6426499 

Top words similar to: goblin 
    goblin       puss  sorceress      demon  hobgoblin     dorian       mage 
 1.0000000  0.6930484  0.6864744  0.6827572  0.6785817  0.6771488  0.6746703 
     raven   rakshasa spider-man       ogre      darth     ravana    unicorn 
 0.6673271  0.6567830  0.6561245  0.6557366  0.6534529  0.6479248  0.6467908 
    wraith  chameleon  pretender     sauron  voldemort      vader 
 0.6449909  0.6339083  0.6325631  0.6323434  0.6304938  0.6295239 

Top words similar to: demon 
     demon     demons     dragon      beast spider-man     undead      curse 
 1.0000000  0.7878809  0.7491889  0.7392522  0.7283558  0.7276703  0.7158713 
   demonic    etrigan       mage    vampire       evil   creature   monsters 
 0.7043070  0.6991201  0.6975241  0.6965436  0.6939673  0.6899178  0.6897230 
    wizard     goblin   immortal     mortal  voldemort    lucifer 
 0.6830256  0.6827572  0.6783764  0.6773456  0.6736261  0.6721883 

Top words similar to: skeleton 
   skeleton       skull   skeletons       bones  fossilized      skulls 
  1.0000000   0.7348896   0.7211467   0.6956670   0.6695362   0.6685425 
postcranial    specimen    headless  footprints    skeletal    dinosaur 
  0.6663190   0.6639125   0.6633926   0.6522741   0.6401934   0.6391038 
   mastodon       vault    holotype     encased   unearthed        body 
  0.6353807   0.6218795   0.6174828   0.6046492   0.6006260   0.5941755 
  vertebrae   mummified 
  0.5913393   0.5876159 

Top words similar to: clown 
       clown      costume        posse       psycho       circus       clowns 
   1.0000000    0.6975169    0.6902777    0.6862030    0.6827969    0.6801673 
       doink transvestite   dominatrix    pantomime          cop   striptease 
   0.6607320    0.6602358    0.6599859    0.6573607    0.6557975    0.6546285 
   burlesque      dancing    mannequin         cage       scream         sexy 
   0.6491192    0.6485157    0.6417197    0.6392810    0.6381918    0.6354160 
     dressed         doll 
   0.6354121    0.6353050 

WOW, that’s a long list of scary creatures, and we see lots of overlapping. Let’s see what are the unique words and narrow them down to horror-story related ones.

unique_words <- unique(unlist(lapply(similar_words, names)))

unique_words
  [1] "vampire"      "werewolf"     "beast"        "witch"        "vampires"    
  [6] "slayer"       "villain"      "zombie"       "superhero"    "protagonist" 
 [11] "ghost"        "spider"       "demon"        "monster"      "batman"      
 [16] "buffy"        "fairy"        "spider-man"   "mystery"      "dracula"     
 [21] "werewolves"   "warlock"      "mutant"       "kitten"       "zombies"     
 [26] "undead"       "daleks"       "magician"     "bonzo"        "unicorn"     
 [31] "monsters"     "terminator"   "apocalypse"   "cartoon"      "punisher"    
 [36] "killer"       "psycho"       "animated"     "flicks"       "bug"         
 [41] "stranger"     "paradise"     "creature"     "haunted"      "fictional"   
 [46] "adventures"   "tales"        "lonely"       "ghosts"       "tale"        
 [51] "adventure"    "hell"         "mysterious"   "cat"          "wicked"      
 [56] "monkey"       "evil"         "rabbit"       "mad"          "dog"         
 [61] "ape"          "witches"      "snake"        "goblin"       "puss"        
 [66] "sorceress"    "hobgoblin"    "dorian"       "mage"         "raven"       
 [71] "rakshasa"     "ogre"         "darth"        "ravana"       "wraith"      
 [76] "chameleon"    "pretender"    "sauron"       "voldemort"    "vader"       
 [81] "demons"       "dragon"       "curse"        "demonic"      "etrigan"     
 [86] "wizard"       "immortal"     "mortal"       "lucifer"      "skeleton"    
 [91] "skull"        "skeletons"    "bones"        "fossilized"   "skulls"      
 [96] "postcranial"  "specimen"     "headless"     "footprints"   "skeletal"    
[101] "dinosaur"     "mastodon"     "vault"        "holotype"     "encased"     
[106] "unearthed"    "body"         "vertebrae"    "mummified"    "clown"       
[111] "costume"      "posse"        "circus"       "clowns"       "doink"       
[116] "transvestite" "dominatrix"   "pantomime"    "cop"          "striptease"  
[121] "burlesque"    "dancing"      "mannequin"    "cage"         "scream"      
[126] "sexy"         "dressed"      "doll"        

Now we have our list: vampire, werewolf, beast, witch, zombie, ghost, demon, monster, killer, stranger, goblin, hobgoblin, rakshasa, ogre, wraith, wizard, lucifer, skeleton, skull, clown, and doll.

We want to know what are the top scary things appear in the stories.

# Convert fullDfm into data frame
full_df <- convert(storyDfm,to="data.frame")

# Things you see
things <- c("vampire", "werewolf", "beast", "witch", "zombie", "ghost", "demon", "monster", "killer", "stranger", "goblin", "hobgoblin", "rakshasa", "ogre", "wraith", "wizard", "lucifer", "skeleton", "skull", "clown", "doll")

# Filter 'things' to only include words that appear in the corpus (columns in full_df)
existing_things <- things[things %in% colnames(full_df)]

full_things <- full_df[,existing_things]

full_things2 <- as.data.frame(t(full_things)) 

full_things2 <- full_things2 %>% 
  mutate(count = rowSums(.))

full_things2 <- tibble::rownames_to_column(full_things2, "things")

top_things <- head(full_things2 %>% arrange(desc(count)),10)

top_things <- top_things %>% 
  mutate(things = reorder(things, -count))


ggplot(top_things, aes(x=things, y=count)) +
  geom_bar(stat="identity")

Words Co-Occur with the top-10 Monsters

Now that we have found the top-10 Monsters. Let’s check out the co-occurrence matrix of these THINGS. This will give us more information, for example, where they are, what they look like, when they appear…

ten_things <- as.character(top_things$things)

# Loop through each "thing" to create co-occurrence matrices
even_smaller_fcm_list <- list()

for (i in ten_things) {
  # Select tokens that contain each "thing" and its surrounding words
  context <- tokens_select(story_token_lem, pattern = i, window = 10, selection = "keep")
  # Create a feature co-occurrence matrix (FCM)
  fcm_matrix <- fcm(context, context = "window")
  # pull the top features
  top_features <- names(sort(colSums(fcm_matrix), decreasing = TRUE)[1:30]) 
  even_smaller_fcm <- fcm_select(fcm_matrix, pattern = c(i, top_features))
  
  #store it
  even_smaller_fcm_list[[i]] <- even_smaller_fcm
}

Now we make the network graphs for co-occurrence

textplot_network(even_smaller_fcm_list[[1]])

textplot_network(even_smaller_fcm_list[[2]])

textplot_network(even_smaller_fcm_list[[3]])

textplot_network(even_smaller_fcm_list[[4]])

textplot_network(even_smaller_fcm_list[[5]])

textplot_network(even_smaller_fcm_list[[6]])

textplot_network(even_smaller_fcm_list[[7]])

textplot_network(even_smaller_fcm_list[[8]])

textplot_network(even_smaller_fcm_list[[9]])

textplot_network(even_smaller_fcm_list[[10]])

What those THINGS do?

Now we want to find the top verbs that follow after the THINGS. We first reshape the corpus to sentence level, and only look into sentences with THINGS.

#first reshape to sentence level
sentence_corpus <- corpus_reshape(story_corpus, to="sentences")
text_sentence <- as.character(sentence_corpus)

# Create a regex pattern to match any word in the 'things' list
pattern <- paste(things, collapse = "|")

# Filter sentences to only those containing any of the words in 'things'
things_sentences <- text_sentence[grepl(pattern, text_sentence, ignore.case = TRUE)]
things_sentences[1:6]
                                                                                                                                                                                                                                                         text2.6 
                                                                                                                                                                                          "Without wasting any more time, this is how the nightmare progressed." 
                                                                                                                                                                                                                                                         text3.5 
                                                                                                                                                               "I recently popped out a story about a sort of grey laziness-ghost for one of the horror forums." 
                                                                                                                                                                                                                                                        text3.13 
                                                           "My story had my character being scared of the lazy ghost, but re-reading showed me that it was more about the procrastination than the fear and it could honestly have done with him reacting more." 
                                                                                                                                                                                                                                                        text3.22 
                                                                             "I lashed out a little, editing the work to make the MC almost comically terrified, screaming and gibbering at the sight of a dumb ghost that definitely didnt merit the reaction." 
                                                                                                                                                                                                                                                        text3.52 
                                                                                                                                                                                  "It passed straight harmlessly through, striking the wall behind the monster." 
                                                                                                                                                                                                                                                        text5.27 
"I smiled at them as I attempted to walk around them as I knew to not make small talk with strangers...yes we were taught that even back in those days But as I tried to walk around them, I felt hands grab me from behind as I was then thrown to the ground." 
length(things_sentences)
[1] 2904

Now we annotate the sentences.

cnlp_init_udpipe()
#Annotate POS
annotated <- cnlp_annotate(things_sentences)
Processed document 10 of 2904
Processed document 20 of 2904
Processed document 30 of 2904
Processed document 40 of 2904
Processed document 50 of 2904
Processed document 60 of 2904
Processed document 70 of 2904
Processed document 80 of 2904
Processed document 90 of 2904
Processed document 100 of 2904
Processed document 110 of 2904
Processed document 120 of 2904
Processed document 130 of 2904
Processed document 140 of 2904
Processed document 150 of 2904
Processed document 160 of 2904
Processed document 170 of 2904
Processed document 180 of 2904
Processed document 190 of 2904
Processed document 200 of 2904
Processed document 210 of 2904
Processed document 220 of 2904
Processed document 230 of 2904
Processed document 240 of 2904
Processed document 250 of 2904
Processed document 260 of 2904
Processed document 270 of 2904
Processed document 280 of 2904
Processed document 290 of 2904
Processed document 300 of 2904
Processed document 310 of 2904
Processed document 320 of 2904
Processed document 330 of 2904
Processed document 340 of 2904
Processed document 350 of 2904
Processed document 360 of 2904
Processed document 370 of 2904
Processed document 380 of 2904
Processed document 390 of 2904
Processed document 400 of 2904
Processed document 410 of 2904
Processed document 420 of 2904
Processed document 430 of 2904
Processed document 440 of 2904
Processed document 450 of 2904
Processed document 460 of 2904
Processed document 470 of 2904
Processed document 480 of 2904
Processed document 490 of 2904
Processed document 500 of 2904
Processed document 510 of 2904
Processed document 520 of 2904
Processed document 530 of 2904
Processed document 540 of 2904
Processed document 550 of 2904
Processed document 560 of 2904
Processed document 570 of 2904
Processed document 580 of 2904
Processed document 590 of 2904
Processed document 600 of 2904
Processed document 610 of 2904
Processed document 620 of 2904
Processed document 630 of 2904
Processed document 640 of 2904
Processed document 650 of 2904
Processed document 660 of 2904
Processed document 670 of 2904
Processed document 680 of 2904
Processed document 690 of 2904
Processed document 700 of 2904
Processed document 710 of 2904
Processed document 720 of 2904
Processed document 730 of 2904
Processed document 740 of 2904
Processed document 750 of 2904
Processed document 760 of 2904
Processed document 770 of 2904
Processed document 780 of 2904
Processed document 790 of 2904
Processed document 800 of 2904
Processed document 810 of 2904
Processed document 820 of 2904
Processed document 830 of 2904
Processed document 840 of 2904
Processed document 850 of 2904
Processed document 860 of 2904
Processed document 870 of 2904
Processed document 880 of 2904
Processed document 890 of 2904
Processed document 900 of 2904
Processed document 910 of 2904
Processed document 920 of 2904
Processed document 930 of 2904
Processed document 940 of 2904
Processed document 950 of 2904
Processed document 960 of 2904
Processed document 970 of 2904
Processed document 980 of 2904
Processed document 990 of 2904
Processed document 1000 of 2904
Processed document 1010 of 2904
Processed document 1020 of 2904
Processed document 1030 of 2904
Processed document 1040 of 2904
Processed document 1050 of 2904
Processed document 1060 of 2904
Processed document 1070 of 2904
Processed document 1080 of 2904
Processed document 1090 of 2904
Processed document 1100 of 2904
Processed document 1110 of 2904
Processed document 1120 of 2904
Processed document 1130 of 2904
Processed document 1140 of 2904
Processed document 1150 of 2904
Processed document 1160 of 2904
Processed document 1170 of 2904
Processed document 1180 of 2904
Processed document 1190 of 2904
Processed document 1200 of 2904
Processed document 1210 of 2904
Processed document 1220 of 2904
Processed document 1230 of 2904
Processed document 1240 of 2904
Processed document 1250 of 2904
Processed document 1260 of 2904
Processed document 1270 of 2904
Processed document 1280 of 2904
Processed document 1290 of 2904
Processed document 1300 of 2904
Processed document 1310 of 2904
Processed document 1320 of 2904
Processed document 1330 of 2904
Processed document 1340 of 2904
Processed document 1350 of 2904
Processed document 1360 of 2904
Processed document 1370 of 2904
Processed document 1380 of 2904
Processed document 1390 of 2904
Processed document 1400 of 2904
Processed document 1410 of 2904
Processed document 1420 of 2904
Processed document 1430 of 2904
Processed document 1440 of 2904
Processed document 1450 of 2904
Processed document 1460 of 2904
Processed document 1470 of 2904
Processed document 1480 of 2904
Processed document 1490 of 2904
Processed document 1500 of 2904
Processed document 1510 of 2904
Processed document 1520 of 2904
Processed document 1530 of 2904
Processed document 1540 of 2904
Processed document 1550 of 2904
Processed document 1560 of 2904
Processed document 1570 of 2904
Processed document 1580 of 2904
Processed document 1590 of 2904
Processed document 1600 of 2904
Processed document 1610 of 2904
Processed document 1620 of 2904
Processed document 1630 of 2904
Processed document 1640 of 2904
Processed document 1650 of 2904
Processed document 1660 of 2904
Processed document 1670 of 2904
Processed document 1680 of 2904
Processed document 1690 of 2904
Processed document 1700 of 2904
Processed document 1710 of 2904
Processed document 1720 of 2904
Processed document 1730 of 2904
Processed document 1740 of 2904
Processed document 1750 of 2904
Processed document 1760 of 2904
Processed document 1770 of 2904
Processed document 1780 of 2904
Processed document 1790 of 2904
Processed document 1800 of 2904
Processed document 1810 of 2904
Processed document 1820 of 2904
Processed document 1830 of 2904
Processed document 1840 of 2904
Processed document 1850 of 2904
Processed document 1860 of 2904
Processed document 1870 of 2904
Processed document 1880 of 2904
Processed document 1890 of 2904
Processed document 1900 of 2904
Processed document 1910 of 2904
Processed document 1920 of 2904
Processed document 1930 of 2904
Processed document 1940 of 2904
Processed document 1950 of 2904
Processed document 1960 of 2904
Processed document 1970 of 2904
Processed document 1980 of 2904
Processed document 1990 of 2904
Processed document 2000 of 2904
Processed document 2010 of 2904
Processed document 2020 of 2904
Processed document 2030 of 2904
Processed document 2040 of 2904
Processed document 2050 of 2904
Processed document 2060 of 2904
Processed document 2070 of 2904
Processed document 2080 of 2904
Processed document 2090 of 2904
Processed document 2100 of 2904
Processed document 2110 of 2904
Processed document 2120 of 2904
Processed document 2130 of 2904
Processed document 2140 of 2904
Processed document 2150 of 2904
Processed document 2160 of 2904
Processed document 2170 of 2904
Processed document 2180 of 2904
Processed document 2190 of 2904
Processed document 2200 of 2904
Processed document 2210 of 2904
Processed document 2220 of 2904
Processed document 2230 of 2904
Processed document 2240 of 2904
Processed document 2250 of 2904
Processed document 2260 of 2904
Processed document 2270 of 2904
Processed document 2280 of 2904
Processed document 2290 of 2904
Processed document 2300 of 2904
Processed document 2310 of 2904
Processed document 2320 of 2904
Processed document 2330 of 2904
Processed document 2340 of 2904
Processed document 2350 of 2904
Processed document 2360 of 2904
Processed document 2370 of 2904
Processed document 2380 of 2904
Processed document 2390 of 2904
Processed document 2400 of 2904
Processed document 2410 of 2904
Processed document 2420 of 2904
Processed document 2430 of 2904
Processed document 2440 of 2904
Processed document 2450 of 2904
Processed document 2460 of 2904
Processed document 2470 of 2904
Processed document 2480 of 2904
Processed document 2490 of 2904
Processed document 2500 of 2904
Processed document 2510 of 2904
Processed document 2520 of 2904
Processed document 2530 of 2904
Processed document 2540 of 2904
Processed document 2550 of 2904
Processed document 2560 of 2904
Processed document 2570 of 2904
Processed document 2580 of 2904
Processed document 2590 of 2904
Processed document 2600 of 2904
Processed document 2610 of 2904
Processed document 2620 of 2904
Processed document 2630 of 2904
Processed document 2640 of 2904
Processed document 2650 of 2904
Processed document 2660 of 2904
Processed document 2670 of 2904
Processed document 2680 of 2904
Processed document 2690 of 2904
Processed document 2700 of 2904
Processed document 2710 of 2904
Processed document 2720 of 2904
Processed document 2730 of 2904
Processed document 2740 of 2904
Processed document 2750 of 2904
Processed document 2760 of 2904
Processed document 2770 of 2904
Processed document 2780 of 2904
Processed document 2790 of 2904
Processed document 2800 of 2904
Processed document 2810 of 2904
Processed document 2820 of 2904
Processed document 2830 of 2904
Processed document 2840 of 2904
Processed document 2850 of 2904
Processed document 2860 of 2904
Processed document 2870 of 2904
Processed document 2880 of 2904
Processed document 2890 of 2904
Processed document 2900 of 2904
head(annotated$token)
# A tibble: 6 × 11
  doc_id    sid tid   token   token_with_ws lemma   upos  xpos  feats tid_source
  <chr>   <int> <chr> <chr>   <chr>         <chr>   <chr> <chr> <chr> <chr>     
1 text2.6     1 1     Without "Without "    without SCONJ IN    <NA>  2         
2 text2.6     1 2     wasting "wasting "    waste   VERB  VBG   Verb… 11        
3 text2.6     1 3     any     "any "        any     DET   DT    <NA>  5         
4 text2.6     1 4     more    "more "       more    ADJ   JJR   Degr… 5         
5 text2.6     1 5     time    "time"        time    NOUN  NN    Numb… 2         
6 text2.6     1 6     ,       ", "          ,       PUNCT ,     <NA>  11        
# ℹ 1 more variable: relation <chr>

Find verbs after the THINGS. We first join the source and token.

tokens <- annotated$token
# Filter tokens to find verbs that follow any of the "things"
joined_POS <- annotated$token %>%
  left_join(
    annotated$token,
    by = c("doc_id" = "doc_id", "sid" = "sid", "tid" = "tid_source"),  # Shift `tid` for next token
    suffix = c("", "_next")
  ) %>%
  filter(lemma %in% things) %>%          # Filter rows where the token is a "thing"
  filter(upos_next == "VERB") 

head(joined_POS)
# A tibble: 6 × 19
  doc_id       sid tid   token  token_with_ws lemma upos  xpos  feats tid_source
  <chr>      <int> <chr> <chr>  <chr>         <chr> <chr> <chr> <chr> <chr>     
1 text3.22       1 27    ghost  "ghost "      ghost NOUN  NN    Numb… 23        
2 text20.165     1 15    monst… "monster "    mons… NOUN  NN    Numb… 9         
3 text23.417     1 17    monst… "monster "    mons… NOUN  NN    Numb… 30        
4 text28.14      1 17    killer "killer"      kill… NOUN  NN    Numb… 0         
5 text30.16      1 2     stran… "strangers "  stra… NOUN  NNS   Numb… 0         
6 text30.184     1 6     stran… "strangers "  stra… NOUN  NNS   Numb… 3         
# ℹ 9 more variables: relation <chr>, tid_next <chr>, token_next <chr>,
#   token_with_ws_next <chr>, lemma_next <chr>, upos_next <chr>,
#   xpos_next <chr>, feats_next <chr>, relation_next <chr>

Now we can filter unique verbs and make a bar plot.

verb_counts <- joined_POS %>%
  select(verb = lemma_next) %>%          # Select the following verb
  count(verb, sort = TRUE) %>%           # Count occurrences of each verb
  as.data.frame()      

verb_counts <- verb_counts %>% 
  mutate(verb = reorder(verb, -n))

top_verb <- verb_counts[1:15,]

ggplot(top_verb, aes(x=verb, y=n)) +
  geom_bar(stat="identity")+
  labs(x="Verbs",y="Count")

Words to Describe Things

Now that we find the actions of THINGS. Then what are the adjectives story tellers use to describe those THINGS?

joined_POS_adj <- annotated$token %>%
  # Join with itself to access the adjectives (source tokens) for each "thing"
  left_join(
    annotated$token,
    by = c("doc_id" = "doc_id", "sid" = "sid", "tid" = "tid_source"),
    suffix = c("", "_source")
  ) %>%
  # Filter for rows where the token is in "things" and the modifier is an adjective
  filter(lemma %in% things) %>%            # Filter rows where the token is a "thing"
  filter(relation_source == "amod" & upos_source == "ADJ")     

head(joined_POS_adj)
# A tibble: 6 × 19
  doc_id     sid tid   token   token_with_ws lemma  upos  xpos  feats tid_source
  <chr>    <int> <chr> <chr>   <chr>         <chr>  <chr> <chr> <chr> <chr>     
1 text3.13     1 11    ghost   "ghost"       ghost  NOUN  NN    Numb… 7         
2 text3.22     1 27    ghost   "ghost "      ghost  NOUN  NN    Numb… 23        
3 text7.81     1 11    vampire "vampire"     vampi… NOUN  NN    Numb… 8         
4 text8.37     1 67    doll    "doll "       doll   NOUN  NN    Numb… 51        
5 text8.37     1 92    doll    "doll "       doll   NOUN  NN    Numb… 6         
6 text8.83     1 9     dolls   "dolls"       doll   NOUN  NNS   Numb… 5         
# ℹ 9 more variables: relation <chr>, tid_source_source <chr>,
#   token_source <chr>, token_with_ws_source <chr>, lemma_source <chr>,
#   upos_source <chr>, xpos_source <chr>, feats_source <chr>,
#   relation_source <chr>
adj_counts <- joined_POS_adj %>%
  select(adj = lemma_source) %>%          # Select the previous adj
  count(adj, sort = TRUE) %>%           # Count occurrences of each verb
  as.data.frame()      

adj_counts <- adj_counts %>% 
  mutate(adj = reorder(adj, -n))

top_adj <- adj_counts[1:15,]

ggplot(top_adj, aes(x=adj, y=n)) +
  geom_bar(stat="identity")+
  labs(x="Adjectives",y="Count")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Hmm… some of the adjectives doesn’t make sense. Why there’s serial? Why “human” is an adjective for stary things? Let’s figure out.

adj_thing_pair <- joined_POS_adj %>%
  select(thing = lemma, adjective = lemma_source) %>%
  filter(!is.na(adjective)) %>%    # Exclude missing adjectives
  count(thing, adjective, sort = TRUE) %>%  # Count occurrences of each adj-thing pair
  as.data.frame()

plot_data <- adj_thing_pair %>%
  mutate(label = paste(adjective, thing, sep = "-")) %>%  # Create labels like "serial-killer"
  arrange(desc(n)) %>%  
  slice_head(n = 15) %>% 
  mutate(label = reorder(label, -n))

ggplot(plot_data, aes(x=label, y=n)) +
  geom_bar(stat="identity")+
  labs(x="Adj-Thing Pair",y="Count")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Now serial-killer, human-skull and complete-stranger makes more sense.

Most Horrifying Stories

Now we want to use sentiment analysis (NRC dictionary) to detect the most horrifying stories. Here I want to use the liwcalike function, so I can incorporate the length of the story.

story_sen <- liwcalike(story_corpus,data_dictionary_NRC)

head(story_sen,10)
   docname Segment      WPS   WC Sixltr   Dic anger anticipation disgust fear
1    text1       1 18.29730 1354  11.60 13.66  1.18         1.03    1.03 1.26
2    text2       2 14.25000 2679   9.00 16.35  0.86         1.94    1.08 1.42
3    text3       3 18.74138 1087  16.38 21.71  2.48         1.56    1.66 3.50
4    text4       4 12.95924 4769  11.97 20.40  1.09         2.10    0.71 1.47
5    text5       5 34.04444 1532   8.42 17.49  1.37         1.70    0.85 1.70
6    text6       6 15.19565 2097  15.78 18.55  1.00         2.38    0.86 2.10
7    text7       7 13.11382 1613  12.09 13.33  1.05         0.87    1.74 1.55
8    text8       8 19.01724 2206  11.83 14.87  1.31         1.72    0.59 1.90
9    text9       9 13.43103  779  12.45 15.92  0.64         1.16    0.51 1.93
10  text10      10 11.10849 2355  12.14 14.86  1.23         1.10    0.59 2.21
    joy negative positive sadness surprise trust AllPunc Period Comma Colon
1  0.81     2.66     2.22    1.18     0.96  1.33   17.36   8.12  6.43  0.00
2  1.23     3.02     2.87    1.38     0.67  1.87   11.68   6.72  2.09  0.07
3  0.64     4.51     2.58    1.66     0.83  2.30   14.26   5.43  4.69  0.28
4  1.93     3.19     3.96    2.05     1.07  2.83    9.86   7.36  1.80  0.00
5  1.70     2.55     2.74    2.09     0.98  1.83    7.77   4.83  2.74  0.00
6  1.38     2.91     3.00    2.00     0.91  2.00   15.21   6.68  5.05  0.29
7  0.74     3.22     1.49    0.93     0.81  0.93   16.31   6.45  4.28  0.06
8  1.18     2.31     2.45    1.18     0.59  1.63    8.20   4.94  1.31  0.09
9  1.54     3.34     2.82    0.90     0.90  2.18   18.61   7.96  9.24  0.00
10 0.55     2.93     1.91    2.08     0.42  1.83   17.62   8.28  5.69  0.17
   SemiC QMark Exclam Dash Quote Apostro Parenth OtherP
1    0.0  0.52   0.15 0.07  2.07       0    0.00  17.28
2    0.0  0.22   0.30 0.60  0.15       0    0.15  10.15
3    0.0  0.64   0.09 0.46  1.38       0    0.00  12.51
4    0.0  0.55   0.06 0.00  0.00       0    0.00   9.86
5    0.0  0.00   0.07 0.13  0.00       0    0.00   7.64
6    0.1  0.48   0.14 0.19  0.29       0    0.00  14.12
7    0.0  0.43   0.99 0.37  3.72       0    0.00  15.93
8    0.0  0.32   0.05 0.05  1.27       0    0.09   7.98
9    0.0  0.13   0.00 0.00  1.28       0    0.00  18.61
10   0.0  0.55   0.34 0.55  0.00       0    0.08  16.52
ggplot(story_sen)+
  geom_histogram(aes(x=fear))+
  theme_bw()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

quantile(story_sen$fear)
  0%  25%  50%  75% 100% 
0.00 1.35 1.73 2.17 6.10 

The 75% quantile of fear is 2.17. Let’s look into stories with fear score equal to or larger than 2.

fear_corpus <- story_corpus[which(story_sen$fear >= 2)]

Now we make a word cloud for top fear corpus.

# remove punctuations, numbers, symbols, etc, and to lower case
fear_token <- tokens(fear_corpus,
                     remove_punct = T,
                     remove_symbols = T,
                     remove_numbers = T,
                     remove_url = T,
                     split_hyphens = F,
                     include_docvars = T) %>%
  tokens_tolower()
# remove stopwords
fear_token <- tokens_select(story_token,                 pattern=c(stopwords("en"),"t","s","re","im","2","wasn","didn","be","id","couldnt","fuck"),
                  selection="remove")

# lemmatization
fear_token_lem <- tokens_replace(fear_token,
                                 pattern=lexicon::hash_lemmas$token,
                                 replacement = lexicon:: hash_lemmas$lemma)

# create document feature matrix
fearDfm <- fear_token_lem %>%
  dfm()

set.seed(1234)
textplot_wordcloud(fearDfm, 
                   min_count = 20,
                   random_order = FALSE)

This looks good, but a bit messy. We want to reduce the information on the word cloud. So we trim the dfm to include only terms that appear in 50% or less of of the document.

smaller_fearDfm <- dfm_trim(fearDfm, max_docfreq = .5, docfreq_type = "prop")

# Make wordcloud
set.seed(1234)
textplot_wordcloud(smaller_fearDfm, 
                   min_count = 20,
                   max_words = 100,
                   random_order = FALSE)
Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
parent could not be fit on page. It will not be plotted.
Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
table could not be fit on page. It will not be plotted.
Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
shoulder could not be fit on page. It will not be plotted.

Now we see words like creature, shadow, darkness, whisper, etc. This looks more horror-story related.

Topic Modeling

Now let’s check out what are the topics in the horror stories. Let’s look into the full corpus first.

# full dfm (doesn't make sense)
full_topic_5 <- stm(storyDfm, K = 5,
                  verbose = FALSE, init.type = "Spectral")
Warning in dfm2stm(x, docvars, omit_empty = TRUE): Dropped 958 empty
document(s)
plot(full_topic_5,type="summary")

full_topic_10 <- stm(storyDfm, K = 10,
                  verbose = FALSE, init.type = "Spectral")
Warning in dfm2stm(x, docvars, omit_empty = TRUE): Dropped 958 empty
document(s)
plot(full_topic_10,type="summary")

The topics using full corpus doesn’t make much sense. It’s not horrifying at all. Let’s look into the fear corpus then.

fear_topic5 <- stm(fearDfm, K = 5,
                       verbose = FALSE, init.type = "Spectral")
Warning in dfm2stm(x, docvars, omit_empty = TRUE): Dropped 958 empty
document(s)
plot(fear_topic5,type="summary")

fear_topic10 <- stm(fearDfm, K = 10,
                       verbose = FALSE, init.type = "Spectral")
Warning in dfm2stm(x, docvars, omit_empty = TRUE): Dropped 958 empty
document(s)
plot(fear_topic10,type="summary")

Don’t like this result either… Let’s try the trimmed fear dfm then.

small_fear_topic5 <- stm(smaller_fearDfm, K = 5,                       verbose = FALSE, init.type = "Spectral")
Warning in dfm2stm(x, docvars, omit_empty = TRUE): Dropped 958 empty
document(s)
plot(small_fear_topic5,type="summary")

Seems the top 1 topic: whisper, shadow, dark, is the most horrifying. Let’s make a word cloud for the stories that contribute to this topic.

#Get document-topic probabilities
topic_probs <- small_fear_topic5$theta

#Identify documents where Topic2 is the highest probability topic
threshold <- 0.5  
topic2_docs <- which(topic_probs[, 2] > threshold)

#Subset the DFM to only include documents that contribute strongly to Topic2
topic2_fearDfm <- smaller_fearDfm[topic2_docs, ]

#Wordcloud
set.seed(1234)
textplot_wordcloud(topic2_fearDfm, 
                   max_words = 100, 
                   min_count = 20, 
                   random_order = FALSE)